208 research outputs found

    TABARNAC: Tools for Analyzing Behavior of Applications Running on NUMA Architecture

    Get PDF
    In modern parallel architectures, memory accesses represent a commonbottleneck. Thus, optimizing the way applications access the memory is an important way to improve performance and energy consumption. Memory accesses are even more important with NUMAmachines, as the access time to data depends on its location inthe memory. Many efforts were made todevelop adaptive tools to improve memory accesses at the runtime by optimizingthe mapping of data and threads to NUMA nodes. However, theses tools are notable to change the memory access pattern of the original application,therefore a code written without considering memory performance mightnot benefit from them. Moreover, automatic mapping tools take time to convergetowards the best mapping, losing optimization opportunities. Adeeper understanding of the memory behavior can help optimizing it,removing the need for runtime analysis.In this paper, we present TABARNAC, a tool for analyzing the memory behavior of parallel applications with a focus on NUMA architectures.TABARNAC provides a new visualization of the memory access behavior, focusing on thedistribution of accesses by thread and by structure. Such visualization allows thedeveloper to easily understand why performance issues occur and how to fix them.Using TABARNAC, we explain why some applications do not benefit from dataand thread mapping. Moreover, we propose several code modifications toimprove the memory access behavior of several parallel applications.Les accès mémoire représentent une source de problème de performance fréquenteavec les architectures parallèle moderne. Ainsi optimiser la manière dont lesapplications accèdent à la mémoire est un moyen efficace d'améliorer laperformance et la consommation d'énergie. Les accès mémoire prennent d'autantplus d'important avec les machines NUMA où le temps d'accès à une donnéedépend de sa localisation dans la mémoire. De nombreuse études ont proposéesdes outils adaptatif pour améliorer les accès mémoire en temps réel, cesoutils opèrent en changeant le placement des données et des thread sur lesnœuds NUMA. Cependant ces outils n'ont pas la possibilité de changer la façondont l'application accède à la mémoire. De ce fait un code développé sansprendre en compte les performances des accès mémoire pourrait ne pas en tirerparti. De plus les outils de placement automatique ont besoin de temps pourconverger vers le meilleur placement, perdant des opportunités d'optimisation.Mieux comprendre le comportement mémoire peut aider à l'optimiser et supprimerle besoin d'optimisation en temps réel.Cette étude présente TABARNAC un outil pour analyser le comportement mémoired'application parallèles s'exécutant sur architecture NUMA. TABARNAC offreune nouvelle forme de visualisation du comportement mémoire mettant l'accentsur la distribution des accès entre les thread et par structure de données. Cetype de visualisations permettent de comprendre facilement pourquoi lesproblèmes de performances apparaissent et comment les résoudre. En utilisantTABARNAC, nous expliquons pourquoi certaines applications ne tirent pas partid'outils placement de donnée et de thread. De plus nous proposons plusieursmodification de code permettant d'améliorer le comportement mémoire de plusieursapplications parallèles

    TABARNAC: Visualizing and Resolving Memory Access Issues on NUMA Architectures

    Get PDF
    International audienceIn modern parallel architectures, memory accesses represent a common bottleneck. Thus, optimizing the way applications access the memory is an important way to improve performance and energy consumption. Memory accesses are even more important with NUMA machines, as the access time to data depends on its location in the memory. Many efforts were made to develop adaptive tools to improve memory accesses at the runtime by optimizing the mapping of data and threads to NUMA nodes. However, theses tools are not able to change the memory access pattern of the original application, therefore a code written without considering memory performance might not benefit from them. Moreover, automatic mapping tools take time to converge towards the best mapping, losing optimization opportunities. A deeper understanding of the memory behavior can help optimizing it, removing the need for runtime analysis. In this paper, we present TABARNAC , a tool for analyzing the memory behavior of parallel applications with a focus on NUMA architectures. TABARNAC provides a new visualization of the memory access behavior, focusing on the distribution of accesses by thread and by structure. Such visualization allows the developer to easily understand why performance issues occur and how to fix them. Using TABARNAC , we explain why some applications do not benefit from data and thread mapping. Moreover, we propose several code modifications to improve the memory access behavior of several parallel applications. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credi

    ComprehensiveBench: a Benchmark for the Extensive Evaluation of Global Scheduling Algorithms

    Get PDF
    International audienceParallel applications that present tasks with imbalanced loads or complex communication behavior usually do not exploit the underlying resources of parallel platforms to their full potential. In order to mitigate this issue, global scheduling algorithms are employed. As finding the optimal task distribution is an NP-Hard problem, identifying the most suitable algorithm for a specific scenario and comparing algorithms are not trivial tasks. In this context, this paper presents ComprehensiveBench, a benchmark for global scheduling algorithms that enables the variation of a vast range of parameters that affect performance. ComprehensiveBench can be used to assist in the development and evaluation of new scheduling algorithms, to help choose a specific algorithm for an arbitrary application, to emulate other applications, and to enable statistical tests. We illustrate its use in this paper with an evaluation of Charm++ periodic load balancers that stresses their characteristics

    Energy Efficient Seismic Wave Propagation Simulation on a Low-power Manycore Processor.

    No full text
    International audienceLarge-scale simulation of seismic wave propagation is an active research topic. Its high demand for processing power makes it a good match for High Performance Computing (HPC). Although we have observed a steady increase on the processing capabilities of HPC platforms, their energy efficiency is still lacking behind. In this paper, we analyze the use of a low-power manycore processor, the MPPA-256, for seismic wave propagation simulations. First we look at its peculiar characteristics such as limited amount of on-chip memory and describe the intricate solution we brought forth to deal with this processor's idiosyncrasies. Next, we compare the performance and energy efficiency of seismic wave propagation on MPPA-256 to other commonplace platforms such as general-purpose processors and a GPU. Finally, we wrap up with the conclusion that, even if MPPA-256 presents an increased software development complexity, it can indeed be used as an energy efficient alternative to current HPC platforms, resulting in up to 71% and 5.18x less energy than a GPU and a general-purpose processor, respectively

    ComprehensiveBench: a Benchmark for the Extensive Evaluation of Global Scheduling Algorithms

    No full text
    International audienceParallel applications that present tasks with imbalanced loads or complex communication behavior usually do not exploit the underlying resources of parallel platforms to their full potential. In order to mitigate this issue, global scheduling algorithms are employed. As finding the optimal task distribution is an NP-Hard problem, identifying the most suitable algorithm for a specific scenario and comparing algorithms are not trivial tasks. In this context, this paper presents ComprehensiveBench, a benchmark for global scheduling algorithms that enables the variation of a vast range of parameters that affect performance. ComprehensiveBench can be used to assist in the development and evaluation of new scheduling algorithms, to help choose a specific algorithm for an arbitrary application, to emulate other applications, and to enable statistical tests. We illustrate its use in this paper with an evaluation of Charm++ periodic load balancers that stresses their characteristics

    Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

    Get PDF
    In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. Also, we provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. We apply the selected metrics to experimental results obtained in various radiation test campaigns for a total of more than 400 hours of beam time per device. The amount of data we gathered allows us to evaluate the error criticality of a representative set of algorithms from HPC suites. Additionally, based on the characteristics of the tested algorithms, we draw generic reliability conclusions for broader classes of codes. We show that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods. Finally, iterative stencil operations seem the most reliable on both architectures.This work was supported by the STIC-AmSud/CAPES scientific cooperation program under the EnergySFE research project grant 99999.007556/2015-02, EU H2020 Programme, and MCTI/RNP-Brazil under the HPC4E Project, grant agreement n° 689772. Tested K40 boards were donated thanks to Steve Keckler, Timothy Tsai, and Siva Hari from NVIDIA.Postprint (author's final draft

    Análise de desempenho do CRAY Y-MP 2E/232

    Get PDF
    Este texto apresenta urna análise de desempenho do CRA y y -MP 2E1232 instalado atualmente no Centro Nacional de Supercompução (CESUP) administrado pela Universidade Federal do Río Grande do Sul (UFRGS). Basicamente, são descritos e analisados os resultados obtidos em diversos testes realizados durante a execução de um programa altamente vetoriazável. O texto descreve os programas utilizados nos testes e discute questõs relacionadas com vetorização. Durante essa discussão são abordados vários aspectos que influenciam o desempenho do CRA y dentre os quais destacam-se: ordem dos índices nos laços para tratamento de matrizes, preenchimento incompleto dos registradores vetoriais, forma de al alocação dos vetores em memória e tamanho dos vetores processados.This text presents a performance analysis·ofthe CRAY Y-MP 2E1232 which is installed at the "Centro Nacional de Supercomputação" (CESUP) managed by the "Universidade Federal do Río:Grande do Sul" (UFRGS). Basically, the results obtained in severl tests performed during the execution of a highly vectorized program are described and analysed. The text describes the programs used in these tests and examines several questions related to the vectorization. During this discussion several aspects that affect the perfomance of the CRAY are approached. Some of these are: the order of the indices in matrix treatment, incomp1ete fulfillment of vector registers, véctor allocation in memory and size ofthe vectors processed.Eje: ParalelismoRed de Universidades con Carreras en Informática (RedUNCI

    Segmentação de imagens ecocardiográficas utilizando mapas de Kohonen

    Get PDF
    Este artigo apresena a utilização de mapas de Kohonen na segmentação de imagens médicas. As imagens utilizadas tratam-se de ecocaediogramas de fetos humanos. Este exames são de grande importaância, pois tratam de informar se um feto terá ou terá ou não problemas cardíacos durante sua evolução. O processo de segmentação auxilia no reconhecimiento das bordas do coração, possibilitando que problemas graves sejam tratados com grande antecedência, evitando-se possíveis situações perigrosas mais tarde. Para auxiliar na tarefa de diagnosticar problemas cardíacos, a segmentação da imagem ecocardigráfica representa método mais adequado, sendo capaz de delimitar as cavidades do coração. Entretanto, atualmente, os chamados métodos convencionais não conseguem realizar esta tarefa satisfatoriamente. Isto deve-se as características intrínsecas às imagens ecocardiogáficas, como a nitidez limitada. Como alternativa a esta situação, propõe-se a utilização dos maas de Kohonen [KOH89, KOH90]. Mapas de Kohonen são struturas organizadas, geralmente em forma matricial, que são capazes de realizar tarefas semelhantes às do cérebro humano. Assim como cérebro, onde existem regiões responsáveis pela fala, audição, entre outros, os mapas realizam o agrupamento de conhecimientos em regiões. Pode-se dizer que os mapas de Kohonen realizam uma redução dimensional do problema para duas dimensões, no caso de mapas didimensionais. Após treinamento do mapa através de apresentação de amotras de imagens ecocardiográficas, definem-se regiões distintas, cada qual capaz de reconhecer estruturas diferentes do coração. Apartir do mapa treinado, é necessária a definição das regiões nele surgidas. Para tanto, utiliza-se o método de clusterização de imagens proposto por Coleman e Andrews [COL79]. Este método define o melhor numero de clusters através da avaliação de um parâmetro de qualidade dos clusters (b). O critério utilizado é o produto entre as matrizes de disperção entre clusters e intra clusters. Os resultados obtidos apresentam-se bons em relação a qualidade e ao tempo de processamento, sendo que as imagens resultantes mostram as cavidades cardíacas bem delimitadas.Computación Gráfica y VisualizaciónRed de Universidades con Carreras en Informática (RedUNCI
    corecore